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Abstract—This article proposes the modified KNN (K Near- 
est Neighbor) algorithm which receives a graph as its input 
data and is applied to the text categorization. The graph is 
more graphical for representing a word and the synergy effect 
between the text categorization and the word categorization 
is expected by combining them with each other. In this 
research, we propose the similarity metric between two graphs 
representing words, modify the KNN algorithm by replacing 
the exiting similarity metric by the proposed one, and apply 
it to the text categorization. The proposed KNN is empirically 
validated as the better approach in categorizing texts in news 
articles and opinions. In this article, a word is encoded into a 
weighted and undirected graph and it is represented into a list 
of edges. 


I. INTRODUCTION 


Text categorization refers to the process of classifying 
each text into its relevant topics or categories among the 
predefined ones. As its preliminary tasks, a finite number 
of categories are predefined and sample texts which are 
labeled with one or some of the predefined are prepared. 
As the learning process, using the sample labeled texts, the 
classification capacity is constructed. Subsequent texts which 
are given as ones separated from the sample labeled texts are 
classified as the generalization process. Even if other kinds 
of approaches such as manual rule based schemes and other 
heuristic ones are available, in this research, we assume that 
the supervised learning algorithms are used as the approach. 

Let us mention some points which provide the motiva- 
tions for doing this research. Encoding texts into numerical 
vectors causes problems such as huge dimensionality and 
sparse distribution [3]. The graphs became the popular rep- 
resentations of knowledge or information which are called 
ontologies or word nets [1][17]. Because the ontologies 
are used for representing knowledge as graphs, in previous 
works, many algorithms for manipulating graphs. Therefore, 
by these motivations, we encode texts into graphs, and 
modify the machine learning algorithms into versions which 
receive graphs as input data. 

Let us mention what we propose in this research as its 
ideas. Instead of a numerical vector, each text is encoded 
into the graph where its vertices are words and its edges are 
the semantic relations among words. The similarity measure 
between two graphs is defined, considering both the vertices 
and edges. The KNN (K Nearest Neighbors) is modified 


into the graph based version where data items are classified 
based on the similarity between graphs, and applied to the 
text categorization tasks. The adjacency matrix is adopted 
as representation of each graph in this research. 

Let us mention some benefits which are expected from this 
research. By avoiding the problems from encoding texts into 
numerical vectors, we expect the better performance of the 
proposed version than the traditional version of the KNN. 
Since the graphs are more symbolic text representations than 
numerical vectors, we expect more transparency in encod- 
ing so. We expect more compactness of representing texts 
than numerical vectors for processing texts more efficiently. 
Hence, the goal of this research is to implement the text 
categorization which satisfying the benefits. 

This article is organized into the five sections. In Section 
II, we survey the relevant previous works. In Section III, 
we describe in detail what we propose in this research. In 
Section IV, we validate empirically the proposed approach 
by comparing it with the traditional one. In Section V, we 
mention the general discussion on the empirical validations 
and remaining tasks for doing the further research. 


II. PREVIOUS WORKS 


This section is concerned with the previous works which 
are relevant to this research. In Section H-A, we explore the 
previous cases of applying the KNN algorithm to text mining 
tasks. In Section II-B, we survey the schemes of encoding 
texts or words into structured data. In Section II-C and II-D, 
we survey the previous works on the two kinds of non- 
numerical vector based machine learning algorithms: table 
based machine learning algorithms and string vector based 
machine learning algorithms. Therefore, in this section, we 
provide the history about this research, by surveying the 
relevant previous works. 


A. Related Tasks 


This section is concerned with the previous cases of 
applying the modernized machine learning algorithms for 
the text categorization and its related tasks. We mention 
the word categorization to which the modernized KNN 
algorithm is applied, as a task which is related with the 
text categorization. We present the cases of applying the 
modernized KNN algorithm to the text categorization which 


is covered as the challenge of this research. We consider 
the text clustering where the modernized AHC algorithm is 
applied as another related task. This section is intended to 
survey the cases of applying the modernized KNN algorithm 
and the modernized AHC algorithm, for the text categoriza- 
tion and its related tasks. 


Let us mention the previous cases of applying the graph 
based KNN version to the word categorization. In 2006, Jo 
initially proposed the modification of the KNN algorithm 
into its graph based version as an approach to the word 
categorization [20]. In 2018, the modernized version was 
compared with the traditional version as the start of observ- 
ing its better performance in the word categorization [26]. 
In 2018, the better performance of the modernized version 
was completely validated in categorizing words in the three 
text collection [27]. In the above literatures, we observe the 
previous cases of using the modernized version of the KNN 
algorithm for the word categorization. 


Let us survey cases of applying the KNN algorithm which 
processes graphs directly as a modernized version for the 
text categorization which is covered in this research. The 
proposed version of the KNN algorithm is initially asserted 
as the approach to the text categorization by Jo in 2018 
[28]. Its better performance than the traditional version was 
discovered in classifying texts in a small text collection in 
2019 [33]. This research is aimed to finalize validating the 
better performance of the version which receives a graph 
as its input data, in the text classification. In the above 
literatures, we mention the graph based KNN algorithm 
which is used as an approach to the text categorization. 


Let us explore the previous works where the graph based 
AHC algorithm is applied for clustering texts. The graph 
based version was initially asserted as an approach to the 
text clustering by Jo in 2017 [22]. He started to observe its 
better performance than the traditional AHC algorithm in a 
toy experiment, in 2019 [34]. The empirical validation of 
the better performance was finalized by real experiments in 
2020, but not published, yet [36]. The metric which is used 
for evaluation the clustering algorithms in those experiments 
was proposed by Jo and Lee in 2007 [6]. 


We explored the previous cases of applying the proposed 
version of KNN algorithm to the tasks which are relevant 
to this research. The text categorization which is covered in 
this research is aimed to assign topics to texts, depending on 
their contents. The KNN version which is adopted in this 
research processes graphs, directly, and it was applied to 
the word categorization, as well as the text categorization. 
The AHC algorithm which was applied to the text clustering 
was modified in the previous works by the same style of 
doing the KNN algorithm. The goal of this research is to 
validate completely the better performance of the proposed 
KNN algorithm as the approach to the text categorization 
through the real text collections. 


B. Encoding Schemes 


This section is concerned with the various schemes of en- 
coding texts into structured data. In this research, we propose 
that texts should be encoded into graphs as structured data. 
We mention other structured data, such as numerical vectors, 
tables, and string vectors, in surveying previous works. In 
the previous works which are explored in this section, we 
will present the modified versions of the KNN algorithm 
which process the structured data, directly. This section is 
intended to survey the previous works on encoding texts into 
three kinds of structured data. 

Let us review the previous cases of encoding words or 
texts into numerical vectors. In 2018, texts were encoded 
into numerical vectors, in using the AHC algorithm for 
clustering texts [29]. In 2019, words were encoded into nu- 
merical vectors, in using the KNN algorithms for classifying 
them [30]. In 2019, texts were encoded so in using the KNN 
algorithm for classifying them [35]. The similarity between 
numerical vectors is computed by considering the feature 
similarities and the feature value similarities, to prevent the 
poor discriminations among sparse vectors. 

Let us survey the previous works where texts are encoded 
into tables. In 2008, Jo and Cho initially tried to encode 
texts into tables in the text categorization [13]. In 2008, 
texts were encoded so and the online clustering algorithm 
was modified as the approach to the text clustering [9]. 
In 2015, Jo proposed the table matching algorithm where 
texts are encoded into tables as the approach to the text 
categorization [19]. In the above literatures, we presented 
the previous cases where texts are encoded into tables. 

Let us mention the previous cases of encoding a text 
into a string vector as an ordered finite set of strings. In 
2018, texts were encoded into string vectors for modifying 
the KNN algorithm into the string vector based version as 
the approach to the text categorization [31]. In 2018, the 
text summarization is viewed into the classification of each 
paragraph into summary or non-summary, and the string 
vector based version of the KNN algorithm is applied to the 
task [32]. The AHC algorithm is modified as the approach 
to the text clustering into the version where a text is encoded 
into a string vector, in 2020 [37]. In the above literatures, 
we present the cases of encoding texts into string vectors 
for modifying the KNN algorithm and the AHC algorithm. 

We surveyed the previous works on the schemes of en- 
coding texts into structured forms. Texts were encoded into 
numerical vectors, and the similarity metric which considers 
the feature similarities was proposed. Texts were encoded 
into tables and the similarity metric between tables based on 
their shared entries was proposed. Texts were encoded into 
string vectors, and the semantic similarity between them was 
proposed for modifying the KNN algorithm and the AHC 
algorithm as the approaches to the text mining tasks. In this 
research, texts are encoded into graphs, and the similarity 


metric between two graphs which is described in Section 
IH-B is proposed. 


C. Table based Machine Learning Algorithms 


This section is concerned with the previous works on the 
table based approaches to text mining tasks. We will present 
the classification algorithm and the clustering algorithm 
which processes tables, instead of numerical vectors. We will 
mention the table based matching classification algorithm, 
the table based matching clustering algorithm, and the table 
based KNN algorithm, as the kind of the non-numerical 
vector based machine learning algorithms. The significance 
of the previous works which are surveyed in this section is 
to try to solve the problems in encoding texts into numerical 
vectors, such as huge dimensionality, sparse distribution, and 
the poor transparency. This section is intended to explore the 
previous works on the three table based algorithms as the 
approaches to the text mining tasks. 

Let us survey the previous works on the table based 
machine algorithm as an approach to the text categorization. 
In 2008, Jo and Cho initiated solving the problems in 
encoding texts into numerical vectors by proposing initially 
the table based matching algorithm [13]. It was applied to 
the soft text categorization where each text is allowed to 
be classified into more than one category, in 2008 [9]. It 
was improved and stabilized as the approach to the text 
categorization, in 2015 [19]. In the above literatures, we 
present the table based matching algorithm for avoiding the 
problems in encoding texts into numerical vectors. 

Let us survey the previous works on the clustering al- 
gorithm which processes tables, directly. The table based 
matching algorithm was initially applied to the text clus- 
tering, as well as the text categorization, in 2017 [8] 
Its performance was validated in the real text collection, 
20NewsGroup, in 2008 [14]. The online linear clustering 
algorithm was modified into the table based version as the 
approach to the text clustering, in 2008 [10]. In the above 
literatures, we presented the table based clustering algorithm 
which clusters tables, instead of numerical vectors. 

Let us explore the previous works on the table based 
KNN algorithm as a non-numerical vector based classifier. 
It was proposed as the approach to the text categorization by 
defining the similarity between tables as one between texts, 
in 2017 [23]. The version of the KNN algorithm was applied 
to the text summarization which is mapped into an instance 
of text categorization, in 2017 [24]. It was applied to the text 
segmentation as one more text categorization instance, in 
2017 [25]. In the above literatures, we presented the proposal 
of the table based KNN algorithm as a non-numerical vector 
based classification algorithm and its applications to the text 
categorization instances. 

We surveyed the previous works on the table based 
machine learning algorithms as the approaches to the text 
mining tasks. The table based matching algorithm was 


proposed and stabilized as an approach to the text cate- 
gorization. It was applied to the text clustering, as well 
as the text categorization, as a clustering algorithm. The 
KNN algorithm was modified into the table based version 
which processes tables, directly. In this research, the KNN 
algorithm was modified into the graph based version which 
processes graphs, directly, as the alternative one to the table 
based version. 


D. String Vector based Machine Learning Algorithms 


This section is concerned with the previous works on 
the string vector based machine learning algorithms which 
are the approaches to the text categorization and the text 
clustering. A string vector is defined as an ordered finite 
set of strings; numerical values are replaced by strings as 
the elements in a vector. The SVM with the string vector 
kernel function, the NTC (Neural Text Categorizer), and the 
NTSO (Neural Text Self Organizer) will be mentioned the 
typical string vector based machine learning algorithms in 
surveying the previous works. They are used for categorizing 
and clustering texts in the previous works. This section is 
intended to explore the previous works on the three string 
vector based machine learning algorithms. 

Let us survey the previous works on the string vector 
kernel function which indicates the similarity between two 
string vectors. The string vector kernel was initially defined 
and implemented based on the inverted index where each 
word is linked with texts which include itself, in 2007 
[5]. The string vector kernel is implemented by defining 
the similarity matrix as a square matrix which consists of 
semantic similarities between words, in advance, in 2007 [7]. 
The string vector kernel was used for modifying the SVM 
into its string vector based version as the approach to the 
text categorization [11]. In the above literatures, the string 
vector kernel was defined as the similarity between string 
vectors, and the SVM was modified using it. 

Let us explore the previous works on the NTC (Neural 
Text Categorizer) as a string vector based neural networks. 
It was initially created and applied to the text categorization 
by Jo in 2008 [12]. Its better performance was empirically 
validated in both the hard text categorization and the soft 
text categorization, in 2010 [15]. The NTC was applied for 
classifying texts in Arabian by Abainia et al., in 2015 [18], 
and mentioned as an innovative neural networks by Vega 
and Mendez- Vazquez, in 2016 [21]. In the above literatures, 
the proposal, the application, and the citation of the NTC 
are presented. 

Let us survey the previous works on the NTSO (Neural 
Text Self Organizer) as another string vector based neural 
networks. It was initially proposed as the approach to the 
text clustering by Jo and Japkowicz, in 2005 [2]. It was 
mentioned as an innovative neural networks by Zheng et al. 
in 2006 [4]. The progress of the research on the NTSO was 
finalized by its complete validation in the real experiments 


on the text clustering, in 2010 [16]. In the above literatures, 
we presented the initial proposal and the complete validation 
of the NTSO. 

In this research, texts are encoded into graphs, instead of 
string vectors. In the above literatures, text are encoded into 
string vectors as another way of avoiding the problems in 
encoding texts into numerical vectors. It takes very much 
time for building the similarity matrix from a corpus as 
the basis for computing the semantic operations on string 
vectors. The semantic similarities among words depends 
strongly on the corpus; the semantic similarity between two 
words may be different, depending on the corpus. It is 
necessary to define and characterize mathematically more 
semantic operations for modifying other machine learning 
algorithms into their string vector based versions. 


III. PROPOSED APPROACH 


This section is concerned with encoding words into 
graphs, modifying the KNN (K Nearest Neighbor) into the 
graph based version and applying it to the text categoriza- 
tion, and consists of the three sections. In section III-A, 
we deal with the process of encoding texts into graphs. In 
section III-B, we describe formally the process of computing 
the similarity between two graphs. In section III-C, we do 
the graph vector based KNN version as the approach to the 
text categorization. In Section III-D, we present the system 
architecture and the execution flow of the proposed system. 


A. Text Encoding 


This section is concerned with the process of encoding a 
text into a graph. The graph is defined in the context of the 
data structure as the two sets: the vertex set and the edge 
set. The words in the graph which represents a text are given 
as vertices. A semantic similarity between words is given as 
an edge, and computed based on collocations of words in 
a corpus. This section is intended to describe the steps of 
encoding a text into a graph, in detail. 

The process of indexing a text into a list of words as 
vertices is illustrated in Figure 1. In representing a text into 
a graph, words are defined as vertices. A single text is given 
as the input in the left side in Figure 1, and N words are 
given as the results from indexing the text in the right side. 
The basic steps of indexing a text for generating vertices are 
the tokenization, the stemming, and the stopword removal. 
The vertex set is constructed in this step for constructing a 
graph from the input text. 

The definition of edges in the graph which represents a 
text is illustrated in Figure 2. The N words were already 
generated by the process which is illustrated in Figure 1. 
All possible pairs are generated from the N words and the 
semantic similarity is computed for each pair by the equation 
which is presented in Figure 2. The similarity between two 
words is always given as a normalized value between zero 
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Figure 1. Vertex Definition 


and one. We need only some edges with higher similarities, 
instead of all complete edges for building a graph. 
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Figure 2. Edge Definition 


The graph which represents a text is illustrated in Figure 3, 
as a simple example. The four words, information, computer, 
business, and system, are given as the vertices of the graph. 
The edges in Figure 3 are given as the complete edges, and 
the weight of each edge indicates the similarity between 
words as vertices. The similarity between vertices becomes 
an edge identifier of the graph. A corpus is needed for 
computing the similarity between words, based on their 
collocations. 

Let us make some remarks on the process of encoding 
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Figure 3. Graph representing a Text 


a text into a graph. The graph is defined formally as the 
two sets: the vertex set and the edge set. In the graph 
which represents a text, its vertices are given as words, and 
its edges are given as the similarities among words. The 
similarity which weights each edge is computed based on 
the collocations of words in texts. In this research, each 
graph is represented into an edge set, in the implementation 
level. 


B. Similarity Metric 


This section is concerned with the computation of simi- 
larity between graphs. A graph is represented into a set of 
edges in the implementation level. The similarity between 
edges is computed and it is expanded into one between 
two graphs. The similarity between two graphs is always 
given as a normalized value between zero and one, and 
proportional to the shared edges between two graphs. This 
section is intended to describe the similarity metric between 
two graphs which is proposed in this research. 

The three cases which are considered in computing a 
similarity between two edges is illustrated in Figure 4, and 
the two edges are defined as the entries, each of which 
consists of its two vertices and its weight, as shown in 
equation (1), 


€1 = (v11, V12, W1), €2 = (Vai, V22, W1) (1) 


If two vertices are same to each other in the two edges 
as shown in the left of Figure 4, the two edge weights 
are averaged as the similarity between edges, as shown in 
equation (2), 


if ((v11 = va1) A (Vig = v22)) V ((v11 = v22) A (via = v21)) 


1 
then sim/(e,,e2) = 5 fw + wa) 


(2) 


If either of the two vertices is same to each other in two 
edges, as shown in the middle of Figure 4, the product of 
two weights is the similarity between edges, as shown in 


equation (3), 


if (((v11 = Va1) A (v12 A Va2)) V ((v11 = V22) A (v12 F V21)) 
V ((v11 A Va1) A (v12 = v22)) V ((v11 F V22) A (v12 = v21))) 
then sim(e1, e2) = w,- We 

(3) 
If any vertex is not same to each other in the two edges 


as the right of Figure 4, the similarity between the edges 
becomes zero, as shown in equation (4), 


if ((v11 F V21) A (v12 F V22)) V (vir F V22) A (12 F v21)) 
then sim(e1, e2) = 0 

(4) 
In computing the similarity between the two edges, it is 


assumed that the weight which is assigned to each edge is 
always given as a normalized value between zero and one. 


Figure 4. Three Cases in computing Edge Similarity 


Let us compute the similarity between an edge and a 
graph by expanding one between edges. The similarity 
between two edges, sim/(e,e€2), is computed by the above 
process, and the similarity between an edge and a graph, 
sim(ey, G2), where Go = {ea1, €225+++5 €2/Ga|}> is done, 
now. The maximum of the similarities of the edge, e;, with 


the edges of the graph, Go, is the similarity, sim(e,, G2), 
as expressed by equation (5), 
; IGo|_, 
sim(e1, G2) = max sim(ey, €2i) (5) 
€max 18 the edge of the graph, G2, which satisfy equation 
(6), as the most similar one as the edge, e 
|G2| 


max sim(e1, €2i) = sim(e1, Emax) (6) 


We need to remove the edges with no vertex which is shared 
by the edge, e1, in the graph, G2, in advance, for the more 
efficient computation. 

Let us compute the similarity between two graphs by 
expanding one between an edge and a graph. The two 
graphs, G; and G2 , are expressed respectively into 
the two sets, G; = {€11,€12,..-,€1)¢,;} and Gg = 
{€21, €22,-++;€2/Gy|}- The similarity between G, and G2 
is computed by equation (7), 


|G1| 
. 1 ; 
sim(G 1, G2) = iG ) sim(e1;, G2) (7) 
LW jy 


The similarity between two graphs is always a normalized 
value between zero and one, as shown in equation (8), 


0 < sim(Gy, G2) SS 1 (8) 


The similarity metric which is expressed in equation (7), is 
used for modifying the KNN algorithm into the graph based 
as the approach to the text categorization. 

Let us make some remarks on the similarity metric 
between two graphs which is covered in this section. The 
similarity between two edges is computed, considering the 
three cases. The maximum of the similarities of an edge 
with ones in a graph is the similarity between an edge 
and a graph. Average over the similarities of edges of first 
graph with ones in the second graph becomes the similarity 
between two graphs. The similarity metric between two 
graphs is utilized for modifying the KNN algorithm into 
the graph based version which processes graphs directly. 


C. Proposed Version of KNN 


This section is concerned with the graph based KNN 
algorithm as the approach to the text classification. In the 
previous section, we described the similarity metric between 
two graphs which is used for modifying the KNN algorithm 
into the proposed version. A novice text is encoded into 
a graph, and its similarities with the training graphs are 
computed, using the similarity metric. Like the traditional 
version of the KNN, the labels of nearest neighbors are voted 
for deciding one of the novice one. This section is intended 
to describe the modified version of the KNN algorithm, as 
an approach to the text classification. 

Figure 5 illustrated that the similarities of a novice 
graph with the sample graphs are computed for selecting 


nearest neighbors. A novice text is encoded into the 
graph, Gov, the predefined categories are notated by 
C = {c1,C2,...,¢\c\}, and the training set which consists 
of n sample graphs which represent the sample texts 
is notated by Tr = {(Gi1,y1), (Go, y2),---,(Gn, Yn) }; 
where G; is a sample graph, and y; € C. The similarities 
of the novice graph, Gyo, with the sample graphs, 
G1,G2,...,Gn, are computed by equation (7), as 
sim(Gnov, G1), sim(Gnov, G2), .--, 84M(Gnov, Gn) 

in the proposed KNN_ algorithm. The © similarity 
between the novice graph, G,,.,, and a sample 
graph, is given as a normalized value between zero 
and one, as shown in equation (8). The similarities, 
sim(Gnov, G1), sim(Gnov, G2),.--,8tM(Gnov,Gn) are 
ranked by their values for selecting nearest neighbors. 
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Figure 5. Similarities of a Novice Graph with Sample Ones 
The process of selecting nearest neighbors after 
computing their similarities with the novice item 
is illustrated in Figure 6. The similarities which 


are computed by equation (7) are ranked into ones, 
sim(Gnov, G4), 81M(Gnov, G4), .--, 8tM(Gnov, Gi). The 
Xk items with their highest similarities with the novice 
item are selected as its nearest neighbors, as expressed in 


equation (9), 
Near(K, Gnov) = {G),G),..., GR} K «KN (9) 


As an alternative way, we may consider selecting items 
with their higher similarities than a given threshold. We 
use the nearest neighbors,G4,, G5,...,G’, from the training 
examples, for deciding the label of the novice graph, Gnov. 
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Figure 6. Selection of Nearest Neighbors from Training Examples 


The process of voting the labels of the nearest neighbors 
for deciding the label of the novice item is illustrated 
in Figure 7. The nearest neighbors are selected by the 
process which is illustrated in Figure 7, as a set, Ne = 
{G),G5,...,G}, and the function for weighting a nearest 
neighbor by a category is defined as equation (10), 

: 7 
0 otherwise 
For each category, the number of nearest neighbors which 
belong it is counted as shown in equation (11), 


Count(C;, Ne) = i (Ci, G;) (11) 
j=l 


The label of a novice item is decided by the label with 
the majority of the nearest neighbors, Cymax, aS shown in 
equation (12), 


Coe aeenia Count(C;, Ne) (12) 


i=l 


The function, w(C;,Gi) may be expanded into 
w(Ci,G;,Gnov) by augmenting the novice item, if 
the weight is dependent on the distance between the nearest 
neighbor and the novice item. 
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Figure 7. Voting Labels of Training Examples for deciding One of Novice 
Example 


Let us make some remarks on the graph based KNN 
algorithm as the approach to the text categorization. In using 
the version, it is assumed that the sample texts and a novice 
text are encoded into graphs. The similarities of a novice 
graph with the sample graphs is computed by the similarity 
metric which is described in Section III-B. The sample 
graphs are ranked by their similarities with the novice graph, 
and the K sample graphs with their highest similarities are 
selected as its nearest neighbors. The labels of the nearest 
neighbors are voted for deciding the label of the novice one. 


D. Text Categorization System 


This section is concerned with the system architecture 
and the execution flow of the text categorization system. 
The KNN algorithm which processes graphs directly is 
adopted as the approach to the text categorization, and 
was already described in Section III-C. In this system, a 
novice text is encoded into a graph, and classified by the 
KNN algorithm. We present the system architecture and the 
execution process in the step of designing the system, and 
consider its implementation in Java or Python in the next 
research. This section is intended to describe the sampling 
process, the system architecture, and the execution process 
of the system. 

In Figure 8, gathering texts as samples for each topic 
is illustrated. The M topics are predefined as a list under 
the assumption which the text categorization belong to the 
flat classification. Texts are gathered and allocated to each 
topic, and encoded into graphs by the process which was 
described in Section III-A. The M groups of graphs are given 
as the training set in the system, as shown in the bottom 
of Figure 8. The hierarchical text categorization where the 
categories are predefined as a tree will be considered in the 
next research. 

The system architecture of the text classification system 
is illustrated in Figure 9. In the encoding module, texts are 
encoded into graphs by the process which is described in 
Section III-A. The role of the similarity computation module 
is to compute the similarities of a graph which represents 
a novice text and with ones which represent the sample 
texts, and selecting some with their highest similarities as 
the nearest neighbors. The role of the voting module is to 
decide he label of the novice text by voting ones of the 
nearest neighbors. The role of the proposed system is to 
classify unlabeled texts. 

The execution process of the text classification system is 
illustrated in Figure 10. The similarity matrix is constructed 
for defining edges from the sample texts, and both the 
sample texts and a novice one are encoded into graphs. The 
similarity between graphs is computed as one between a 
novice and a sample in the execution of the KNN algorithm. 
The category of the novice one is decided by voting ones 
of the nearest neighbors. The category of the novice item is 
generated as the output of the system. 

Let us some remarks on the system architecture and 
the execution process of the text clustering system which 
are presented in Figure 9 and 10. Encoding of texts into 
graphs and the similarity between them are proposed in this 
research. The KNN algorithm is modified into the graph 
based version by defining the similarity between graphs as 
one between a novice text and a sample text. This research 
provides the system architecture and the execution flow 
which are needed for doing the general design. In the next 
research, we consider the detail design and the source code 
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Figure 9. System Architecture 


for implementing the system. 


IV. EXPERIMENTS 


This section is concerned with the empirical experiments 
for validating the proposed version of KNN, and consists of 
the five sections. In Section IV-A, we present the results 
from applying the proposed version of KNN to the text 
categorization on the collection, NewsPage.com. In Section 
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Figure 10. Execution Process 


IV-B, we show the results from applying it for categorizing 
texts from the collection, Opinosis. In Section IV-C and 
IV-D, we mention the results from comparing the two 
versions of KNN with each other in categorizing texts from 
20NewsGroups. 


A. NewsPage.com 


This section is concerned with the experiments for val- 
idating the better performance of the proposed version 
on the collection: NewsPage.com. The four categories are 
predefined in this collection, and texts are gathered from the 
collection category by category as labeled ones. Each text is 
classified exclusively into one of the four categories. In this 
set of experiments, we apply the traditional and proposed 
version of KNN to the classification task, without decom- 
posing it into the binary classifications, and use the accuracy 
as the evaluation measure. Therefore, in this section, we 
observe the performance of the both versions of KNN by 
changing the input size. 

In Table I, we specify the text collection, NewsPage.com, 
which is used in this set of experiments. This text collection 
was used for evaluating approaches to text categorization in 
previous works [19]. In the collection, the four categories 
are predefined: Business, Health, Internet, and Sports, and 
375 texts are selected at random in each category. In each 
category, the set of 375 texts is partitioned into the 300 
texts as training ones and the 75 texts as test ones. The text 
collection was built by copying and pasting individual news 
articles from the web site, newspage.com, in 2005, as plain 
text files whose extension is ‘txt’. 


Table I 
THE NUMBER OF TEXTS IN NEWSPAGE.COM 
Category | #Texts | #Training Texts | #Test Texts 
Business 500 300 75 
Health 500 300 75 
Internet 500 300 75 
Sports 500 300 75 
Total 2000 1200 300 


Let us mention the experimental process for validating 


empirically the proposed approach to the task of text cate- 
gorization. In this collection, the texts are labeled with one of 
the four categories which are presented in Table I, and they 
are encoded into numerical vectors and graphs. For each 
test example, the KNN computes its similarities with the 
1200 training examples and selects the three most similarity 
training examples as its nearest neighbors. Each of the 300 
test examples is classified into one of the four categories: 
Business, Sports, Internet, and Health, by voting the labels of 
its nearest neighbors. We compute the classification accuracy 
by dividing the number of correctly classified test examples 
by the number of test examples, for evaluating the both 
versions of KNN algorithm. 

In Figure 11, we illustrate the experimental results from 
categorizing texts, using the both versions of KNN algo- 
rithm. The y-axis indicates the accuracy which is the rate of 
the correctly classified examples in the test set. In the x-axis, 
each group indicates the input size which is the dimension 
of numerical vectors which represent texts. In each group, 
the gray bar and the black bar indicate the achievements 
of the traditional version and the proposed version of KNN 
algorithm, respectively. In the x-axis, the most right group 
indicates the average over the accuracies of the left groups. 
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Let us make the discussions on the results from doing 
the text categorization using the both versions of KNN 
algorithm, as shown in Figure 11. The accuracy which is 
the performance measure of the classification task is in the 
range between 0.35 and 0.52. The proposed version of KNN 
algorithm works strongly better in the three input sizes, 50, 
100, and 200. It loses in the input size, 10. From this set 
of experiments, in spite of the fact, we conclude that the 
proposed version works strongly better than the traditional 
one, in averaging over the four cases. 


B. Opinopsis 
This section is concerned with the set of experiments for 
validating the better performance of the proposed version on 


the collection, Opinosis. The three categories are predefined 
in the collection, and labeled texts are prepared from it. Each 


text is classified exclusively into one of the three categories. 
We do not decompose the given classification into binary 
classifications and use the accuracy as the evaluation mea- 
sure. Therefore, in this section, we observe the performances 
of the both versions of KNN algorithm with the different 
input sizes. 

In Table I, we specify the text collection, Opinosis, 
which is used in this set of experiments. The collection 
was used in previous works for evaluating approaches to 
text categorization. The three categories, ‘Car’, ‘Electron- 
ics’, and ‘Hotel’, are predefined, and all texts are used 
for evaluating the approaches to text categorization, in 
this set of experiments. We use six texts in each cate- 
gory among all texts as the test set as shown in Table 
II. We obtained the collection by downloading it from 
the web site, http://archive.ics.uci.edu/ml/machine-learning- 
databases/opinion/. 


Table II 
THE NUMBER OF TEXTS IN OPINIOPSIS 
Category | #Texts | #Training Texts | #Test Texts 
Car 23 17 6 
Electronic 16 10 6 
Hotel 12 6 6 
Total 51 33 18 


We perform this set of experiments by the process which 
is described in Section IV-A. We use all of 51 texts which 
are labeled with one of the three categories and encode them 
into numerical vectors and graphs with the input sizes: 10, 
50, 100, and 200. For each test example, the both versions of 
KNN computes its similarities with the 33 training examples 
and select the three most similar training examples as its 
nearest neighbors. Each of the 18 test examples is classified 
into one of the three categories, by voting the labels of its 
nearest neighbors. The classification accuracy is computed 
by the number of correctly classified test examples by the 
number of the test examples for evaluating the both versions 
of KNN algorithm. 

In Figure 12, we illustrate the experimental results from 
categorizing texts using the both versions of KNN algorithm. 
Like Figure 11, the y-axis indicates the value of accuracy, 
and the x-axis indicates the group of both versions by an 
input size. In each group, the gray bar and the black bar 
indicate the achievements of the traditional version and the 
proposed version of KNN algorithm, respectively. In Figure 
12, the most right group indicates the averages over results 
over the left four groups. Therefore, Figure 12 presents 
the results from classifying each text into one of the three 
categories by the both versions, on the text collection, 
Opinosis. 

We discuss the results from doing the text categorization 
using the both versions of KNN algorithm, on Opinosis, 
shown in Figure 12. The accuracy values of the bother 
versions range between 0.55 and 1.0. The proposed version 
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Figure 12. Results from Classifying Texts in Text Collection: Opiniopsis 


works better than the traditional one in the three input sizes: 
50, 100, and 200. It shows the perfect results in the input 
size: 100. From this set of experiments, we conclude that 
the proposed version works better than the traditional one, 
in averaging the four cases. 


C. 20NewsGroups I: General Version 


This section is concerned with one more set of experi- 
ments for validating the better performance of the proposed 
version on the text collection, 20NewsGroup I. In this set 
of experiments, we predefine the four general categories in 
this collection, and gather texts from it category by category 
as the classified ones. Each text is classified exclusively into 
one of the four categories. We apply the KNN algorithms 
directly to the given task without decomposing it into binary 
classifications, and use the accuracy as the evaluation mea- 
sure. Therefore, in this section, we observe the performances 
of the both versions with the different input sizes. 

In Table III, we specify the general version of 
20NewsGroups which is used for evaluating the two 
versions of KNN algorithm. In 20NewsGroup, the 
hierarchical classification system is defined with the two 
levels; in the first level, the six categories, alt, comp, rec, 
sci, talk, misc, and soc, are defined, and among them, 
the four categories are selected, as shown in Table III. 
In each category, we select 375 texts from 4000 or 5000 
texts at random. The 375 texts is partitioned into the 
300 texts in the training set and the 75 texts in the test 
sets, as shown in Table III. We obtain the collection, 
20NewsGroup, by downloading from the web site, 


https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html, 


as one of the standard text collection for evaluating 
approaches to text categorization. 

The experimental process is identical is that in the pre- 
vious sets of experiments. In each category, we select the 
375 texts at random and encode them into numerical vectors 
and graphs with the input sizes, 10, 50, 100, and 200. For 
each test example, we compute its similarities with the 1200 
training examples, and select the three similar ones as its 
nearest neighbors. The versions of KNN algorithm classify 


Table III 
THE NUMBER OF TEXTS IN 20NEWSGROUPS I 


Category | #Texts | #Training Texts | #Test Texts 
Comp 5000 300 75 
Rec 4000 300 75 
Sci 4000 300 75 
Talk 4000 300 75 
Total 17000 1200 300 


each of 300 test examples into one of the four categories: 
comp, rec, sci, and talk, by voting the labels of its nearest 
neighbors. We also use the classification accuracy as the 
evaluation measure in this set of experiments. 

In Figure 13, we illustrate the experimental results from 
classifying the texts into one of the four topics on the broad 
version of 20NewsGroups. Figure 13 has the identical frame 
of presenting the results to those of Figure 11 and 12. In 
each group, the gray bar and the black bar indicates the 
achievements of the traditional version and the proposed 
version of KNN algorithm, respectively. Figure 13 presents 
the results from classifying each text into one of the four 
broad categories. In this set of experiments, note that the 
task is not decomposed into binary classifications. 
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Let us discuss the results from classifying the texts using 
the both versions of KNN algorithm on the broad version of 
20NewsGroups into one of the four categories, as shown 
in Figure 3. The accuracies of the both versions range 
between 0.45 and 0.7. The proposed version shows its better 
performance in two of the four input sizes. It keeps its 
competitiveness with the traditional one in the others. From 
this set of experiments, we conclude that the proposed 
version wins over the traditional over, in averaging the 
achievements of the four input sizes. 


D. 20NewsGroups II: Specific Version 


This section is concerned with one more set of experi- 
ments where the better performance of the proposed version 
is validated on another version of 20NewsGroups. In this set 
of experiments, the four specific categories are predefined in 


this collection. Each text is exclusively classified into one 
of the four categories, like the previous sets of experiments. 
We apply the two versions of KNN algorithm, directly to 
the classification task, without decomposing it into binary 
classifications, and use the accuracy as the evaluation metric. 
Therefore, in this section, we observe the performances of 
the both versions of KNN algorithm with the different input 
SiZes. 

In Table IV, we specify the specific version of 20News- 
Groups which is used as the test collection, in this set of ex- 
periments. Within the general category, sci, we predefine the 
four categories: ‘electro’, ‘medicine’, ‘script’, and ‘space’. 
In each category, we select 375 texts among approximately 
1000 texts, at random. In each category, the set of 375 texts 
is partitioned into the training set of 300 texts and the test set 
of 75 texts, like the case in the previous set of experiments. 
The task in the set of experiments in Section IV-C is a broad 
classification, whereas that in this set of experiments is a 
specific classification. 


Table IV 
THE NUMBER OF TEXTS IN 2ONEWSGROUPS II 


Category | #Texts | #Training Texts | #Test Texts 
Electro 1000 300 75 
Medicine 1000 300 75 
Script 1000 300 75 
Space 1000 300 75 
Total 4000 1200 300 


The process of doing this set of experiments is same 
to that in the previous sets of experiments. We select the 
balanced number of texts from the collection over categories, 
and encode them into the representations with the input 
sizes which are identical to those in the previous set of 
experiments. We use the two versions of KNN algorithm 
for their comparisons. Using the two versions of KNN 
algorithm, we classify each text in the test set into one 
of the four specific categories within the general category, 
‘sci’: ‘electro’, ‘medicine’, ‘script’, and ‘space’. We use the 
accuracy as the evaluation metric, like the previous set of 
experiments. 

We present the experimental results from classifying the 
texts using the both versions of KNN algorithm on the 
specific version of 20NewsGroups. The frame of illustrating 
the classification results is identical to the previous ones. 
In each group, the gray bar and the black bar stand for 
the achievements of the traditional version and the proposed 
version, respectively. The y-axis in Figure 14, indicates the 
classification accuracy which is used as the performance 
metric. The texts are classified directly to one of the four 
categories like the cases in the previous sets of experiments. 

Let us discuss on the results from classifying the texts on 
the specific version of 20NewsGroups, as shown in Figure 
14. The accuracies of the both versions range between 0.4 
and 0.8. The proposed version shows its better performance 
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in all of the four input sizes. The performance of both 
versions is correlated with the input size, as shown in Figure 
14. From this set of experiments, it is concluded that the 
proposed version have its outstandingly better performance, 
by averaging over the accuracies of the four input sizes. 


V. CONCLUSION 


Let us discuss the entire results from classifying texts 
using the two versions of KNN algorithm. The both versions 
is compared with each other in the task of text categoriza- 
tion, in these sets of experiments. The proposed version 
show its better results in all of the four collections. The 
accuracies of the traditional version range between 0.35 and 
0.81, while those of the proposed version range between 0.49 
and 1.0. From the four sets of experiments, we conclude 
that the proposed version improves the text categorization 
performance, as the contribution of this research. 

Let us mention the remaining tasks for doing the further 
research. We apply and validate the proposed research in 
classifying technical documents in specific domains such 
as medicine or engineering rather than news articles in 
various domains. We define and characterize more advanced 
operations mathematically on graphs which represent texts. 
We modify more advanced machine learning algorithms 
into their graph based version, using the more sophisticated 
operations. We implement the text categorization system as 
a system module or an independent software by adopting 
the proposed approach. 
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